Python-Style Regular Expressions

Introduction

A regular expression, usually abbreviated "re", "regex" or "regexp", is a language that describes text patterns. Regular Expressions are similar to unix style wildcards. except that regular expressions are much more powerful and the syntax is different.

Different styles of regex languages exist that are somewhat but not entirely compatible. Some examples are perl, python, and unix (sed, awk, grep, ksh, etc.).

PFrank supports python-style regular expressions for search pattern matching in the name filter patterns and in the custom renaming search patterns. In order to obtain a pattern match for a name, only part of the name needs to be matched.
Examples of regular expressions for matching with file/folder names are:

Search Pattern	Action
.*	Match filenames that have 0 or more occurrences of any character. This essentially finds all filenames.
\.	Match filenames that have a dot in it.
\.+	Match filenames that sequences of dots where each sequence has one or more dots.
[^\.]*\.[.]{3,3}$	Match filenames that have 0 or more occurrences of any character (other than a dot), followed by a dot, followed by 3 characters (other than a dot) at the end of the name. This finds all filenames with at least a 3 character extension.
a	Match filenames that have an 'a' in it.
abc	Match filenames that have 'abc' in it.
\.avi	Match filenames that have one or more occurrences of '.avi' in it.
\.avi$	Match filenames that end in '.avi'.
\.(avi\|mpg)$	Match filenames that end in '.avi' or '.mpg'.
(?xi) .*[.](jpg\|jpeg\|gif)$	Match filenames that end in '.jpg' or '.jpeg' or '.gif'. The letters can be lower or upper case.
^abc	Match filenames that start with 'abc'.
[0-9]	Match filenames that have a numerical digit.
[0-9][0-9]+	Match filenames that have one or more occurrences of numerical digit sequences where each sequence has at least 2 digits.
^[0-9]{1-3}abc	Match filenames that start with 1-3 numerical digits followed by 'abc'.

PFrank supports python-style regular expressions for replacement pattern substitution when performing custom renaming. When specifying a replacement pattern string, the replacement string only replaces the portion(s) of the filename that was matched by the search pattern string.

Examples of regular expressions for matching filenames and then replacing them are:

Search Pattern	Replacement Pattern	Action
a	A	Replace all occurrences of the letter 'a' with 'A'.
a+	a	Replace all multiple occurrences of the letter 'a' with a single letter 'a'.
abc	xyz	Replace all occurrences of the sequence 'abc' with 'xyz'.
%20	_	Replace all occurrences of the sequence '%20' with '_'.
abc		Delete all occurrences of abc. i.e. Replace all occurrences of abc with nothing.
[0-9]		Delete all occurrences of the numbers ranging from 0 to 9. i.e. Replace all occurrences of the numbers with nothing.
[aeiou]		Delete all occurrences of the letters 'a', 'e', 'i', 'o', or 'u'. i.e. Replace all vowels with nothing.
$.*?$		Delete all characters enclosed in parentheses (including the parentheses). i.e. Replace all occurrences with nothing.
(.*)	\1.jpg	Insert '.jpg' at the end of the name. The parentheses in '(.*)' of the search pattern identify a group in the regular expression search pattern. The first set of parentheses is group 1 and is referred to in the replacement pattern as '/1'. In this example, the entire filename is identified as group 1. To put a string after the group just set the replacement pattern to '\1' followed by the string you want to insert.
(.*)	Movie - \1	Insert 'Movie - ' at the beginning of the name.
(\s)0{2,2}([0-9]+)	\1\2	Delete 2 leading zeros from numbers that have at least 2 leading zeros. The search expression looks for whitespace followed by 2 zeros followed by any digits. The matching string is then replaced with the whitespace followed by the digits without the 2 leading zeros.
.{2,2}(.*)	abc\1	Replace the first 2 characters of the filename with abc.
(.{64,64}).?(\..)	\1\2	Take the first 64 characters of a filename prefix and add to the extension. Any characters between the first 64 and the extension are removed.
.(.{42,42}).?(\..)	\1\2	Take the first 42 characters of a filename prefix and add to the extension. Any characters between the first 42 and the extension are removed.
.{2,2}(.*)	\1	Delete the first 2 characters of the filename.
.{3,3}(.*)	\1	Delete the first 3 characters of the filename.
(.*?){2,2}$	\1	Delete the last 2 characters of the filename.
([^\.]?)\..{2,2}([^\.]?)$	\1.\2	Delete first 2 characters of a filenames extension.
.{2,2}([^\.]?)\.([^\.]?)$	\1.\2	Delete first 2 characters of a filenames prefix.
([^\.]?)\.([^\.]?).{2,2}$	\1.\2	Delete last 2 characters of a filenames extension.
([^\.]?).{2,2}\.([^\.]?)$	\1.\2	Delete last 2 characters of a filenames prefix.
([^-]?)-([^\.]?)\.([^\.]*?)$	\2-\1.\3	Swap first 2 fields of a filename. In this case the name uses '-' as field separators and has only 2 fields before the extension.
(.+)\.(.+)\.(PF[0-9]+)	1_\3.\2	Move the third group (groups are enclosed in parentheses) after the first. Separate the third group from the first with an underscore character and separate the third from the second with a dot character. This can be used to move the suffix generated by the PFrank tool when it detects duplicates. The PF suffix in this case is moved to just before the extension of the filename.

From the above examples one can deduce that not all characters are interpreted literally but that some of them have special meaning. These types of characters are called meta characters. Here's a complete list of the metacharacters.

[ ] | ^ $ \ . * + ?  { } ( )

The following sections are a general tutorial on how to use regex expressions for string matching and for string replacement.

Meta Character Basics

Most letters and characters will simply match themselves. For example, the regular expression test will match the string "test" exactly.

There are exceptions to this rule; some characters are special, and don't match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the regex by repeating them.

The "[" and "]" are used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a "-". For example, [abc] will match any of the characters "a", "b", or "c"; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your regex would be [a-z]. Metacharacters are not active inside classes. For example, [akm$] will match any of the characters "a", "k", "m", or "$"; "$" is usually a metacharacter, but inside a character class it's stripped of its special nature. You can match the characters not within a range by complementing the set. This is indicated by including a "^" as the first character of the class; "^" elsewhere will simply match the "^" character. For example, [^5] will match any character except "5".

The "|" represents alternation, or the "or" operator. If A and B are regular expressions, A|B will match any string that matches either "A" or "B". | has very low precedence in order to make it work reasonably when you're alternating multi-character strings. Crow|Servo will match either "Crow" or "Servo", not "Cro", a "w" or an "S", and "ervo".

The "^" matches at the beginning of lines. For example, if you wish to match the word "From" only at the beginning of a line, the regex to use is ^From which would match the phrase 'From Here to Eternity' but will not match 'Reciting From Memory'.

The "$" matches at the end of a string. For example, End$ matches 'End' but does not match 'End '

The "\" is one of the most important metacharacters. It can be followed by various characters to signal various special sequences. It's also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a "[" or "\", you can precede them with a backslash to remove their special meaning: \[ or \\. Depending on the regex style, some of the special sequences beginning with "\" represent pre-defined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn't whitespace. The following pre-defined special sequences are available:

Sequence Meaning

\A

Matches only at the start of the string.

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character. Inside a character range, \b represents the backspace character.

\B

Matches the empty string, but only when it is not at the beginning or end of a word.

\d

Matches any decimal digit; this is equivalent to the set i [0-9].

\D

Matches any non-digit character; this is equivalent to the set [^0-9].

\s

Matches any whitespace character; this is equivalent to the set [ \t\n\r\f\v].

\S

Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v].

\w

When the LOCALE and UNICODE flags (see Regex Flags) are not specified, matches any alphanumeric character; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as letters for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

\W

When the LOCALE and UNICODE flags (see Regex Flags) are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]. With LOCALE, it will match any character not in the set [0-9_], and not defined as a letter for the current locale. If UNICODE is set, this will match anything other than [0-9_] and characters marked at alphanumeric in the Unicode character properties database.

\Z

Matches only at the end of the string.

Sequence	Meaning
\A	Matches only at the start of the string.
\b	Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character. Inside a character range, \b represents the backspace character.
\B	Matches the empty string, but only when it is not at the beginning or end of a word.
\d	Matches any decimal digit; this is equivalent to the set i [0-9].
\D	Matches any non-digit character; this is equivalent to the set [^0-9].
\s	Matches any whitespace character; this is equivalent to the set [ \t\n\r\f\v].
\S	Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v].
\w	When the `LOCALE` and `UNICODE` flags (see Regex Flags) are not specified, matches any alphanumeric character; this is equivalent to the set [a-zA-Z0-9_]. With `LOCALE`, it will match the set [0-9_] plus whatever characters are defined as letters for the current locale. If `UNICODE` is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
\W	When the `LOCALE` and `UNICODE` flags (see Regex Flags) are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]. With `LOCALE`, it will match any character not in the set [0-9_], and not defined as a letter for the current locale. If `UNICODE` is set, this will match anything other than [0-9_] and characters marked at alphanumeric in the Unicode character properties database.
\Z	Matches only at the end of the string.

These sequences can be included inside a character class. For example, [\s,|] is a character class that will match any whitespace character, or "," or "|".

The final metacharacter in this section is the period. It matches "any character".

Using Meta Characters for Repeating Things

Being able to match varying sets of characters is the first thing regular expressions can do that isn't already possible with the methods available on strings. However, if that was the only additional capability of regexes, they wouldn't be much of an advance. Another capability is that you can specify that portions of the re must be repeated a certain number of times.

The first metacharacter for repeating things that we'll look at is *. * doesn't match the literal character "*"; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.

For example, ca*t will match "ct" (0 "a"characters), "cat" (1 "a"), "caaat" (3 "a"characters), and so forth. The regex engine has various internal limitations stemming from the size of C's int type, that will prevent it from matching over 2 billion "a" characters; you probably don't have enough memory to construct a string that large, so you shouldn't run into that limit.

Repetitions such as * are greedy; when repeating a re, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don't match, the matching engine will then back up and try again with few repetitions.

A step-by-step example will make this more obvious. Let's consider the expression a[bcd]*b. This matches the letter "a", zero or more letters from the class [bcd], and finally ends with a "b". Now imagine matching this re against the string "abcbd".

Step Matched Explanation

1
a
The a in the regex matches.

2
abcbd
The engine matches [bcd]*, going as far as it can, which is to the end of the string.

3
Failure
The engine tries to match b, but the current position is at the end of the string, so it fails.

4
abcb
Back up, so that [bcd]* matches one less character.

5
Failure
Try b again, but the current position is at the last character, which is a "d".

6
abc
Back up again, so that [bcd]* is only matching "bc".

6
abcb
Try b again. This time but the character at the current position is "b", so it succeeds.

Step	Matched	Explanation
1	`a`	The a in the regex matches.
2	`abcbd`	The engine matches [bcd]*, going as far as it can, which is to the end of the string.
3	Failure	The engine tries to match b, but the current position is at the end of the string, so it fails.
4	`abcb`	Back up, so that [bcd]* matches one less character.
5	Failure	Try b again, but the current position is at the last character, which is a "`d`".
6	`abc`	Back up again, so that [bcd]* is only matching "`bc`".
6	`abcb`	Try b again. This time but the character at the current position is "`b`", so it succeeds.

The end of the regex has now been reached, and it has matched "abcb". This demonstrates how the matching engine goes as far as it can at first, and if no match is found it will then progressively back up and retry the rest of the regex again and again. It will back up until it has tried zero matches for [bcd]*, and if that subsequently fails, the engine will conclude that the string doesn't match the regex at all.

Another repeating metacharacter is +, which matches one or more times. Pay careful attention to the difference between * and +; * matches zero or more times, so whatever's being repeated may not be present at all, while + requires at least one occurrence. To use a similar example, ca+t will match "cat" (1 "a"), "caaat" (3 "a"'s), but won't match "ct".

There are two more repeating qualifiers. The question mark character, ?, matches either once or zero times; you can think of it as marking something as being optional. For example, home-?brew matches either "homebrew" or "home-brew".

The most complicated repeated qualifier is {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n. For example, a/{1,3}b will match "a/b", "a//b", and "a///b". It won't match "ab", which has no slashes, or "a////b", which has four.

You can omit either m or n; in that case, a reasonable value is assumed for the missing value. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of a very large number.

You may have noticed that three other meta characters can all be expressed using this notation. For example, {0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?. It's better to use *, +, or ? when you can, simply because they're shorter and easier to read.

Using Meta Characters for Grouping

Frequently you need to obtain more information than just whether the regex matched or not. Regular expressions are often used to dissect strings by writing a regex divided into several subgroups which match different components of interest. For example, an RFC-822 header line is divided into a header name and a value, separated by a ":". This can be handled by writing a regular expression which matches an entire header line, and has one group which matches the header name, and another group which matches the header's value.

Groups are marked by the "(", ")" metacharacters. "(" and ")" have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them. For example, you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of "ab".

Groups are numbered starting with 0. Group 0 is always present; it's the whole re. Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right. For example, in the pattern (a(b)c)d), the string 'abcd' matches as group 0, the string 'abc' matches as group 1, and the string 'b' matches as group 2

Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise. (\b\w+)\s+\1 is a regex that detects double words in a string. For example, it would match 'the the ' in the string 'Paris in the the spring'

The power of backreferences becomes especially apparent when they are specified in replacement patterns.

Using Meta Characters for Non-capturing and Named Groups

Elaborate regex's may use many groups, both to capture substrings of interest, and to group and structure the regex itself. In complex regex's, it becomes difficult to keep track of the group numbers. There are two features which help with this problem. Both of them use a common syntax for regular expression extensions, so we'll look at that first.

The sequence (?...) is used as a syntax extension. The characters immediately after the "?" indicate what extension is being used, so (?=foo) is one thing (a positive lookahead assertion) and (?:foo) is something else (a non-capturing group containing the subexpression foo).

A "P" following the question mark is an indication that it's an extension that's specific to Python. Currently there are two such extensions: (?P<name>...) defines a named group, and (?P=name) is a backreference to a named group.

Now that we've looked at the general extension syntax, we can return to the features that simplify working with groups in complex regex's. Since groups are numbered from left to right and a complex expression may use many groups, it can become difficult to keep track of the correct numbering, and modifying such a complex regex is annoying. Insert a new group near the beginning, and you change the numbers of everything that follows it.

First, sometimes you'll want to use a group to collect a part of a regular expression, but aren't interested in retrieving the group's contents. You can make this fact explicit by using a non-capturing group: (?:...), where you can put any other regular expression inside the parentheses.

Except for the fact that you can't retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group; you can put anything inside it, repeat it with a repetition metacharacter such as "*", and nest it within other groups (capturing or non-capturing). (?:...) is particularly useful when modifying an existing group, since you can add new groups without changing how all the other groups are numbered. It should be mentioned that there's no performance difference in searching between capturing and non-capturing groups; neither form is any faster than the other.

The second, and more significant, feature is named groups; instead of referring to them by numbers, groups can be referenced by a name.

The syntax for a named group is one of the Python-specific extensions: (?P<name>...). name is, obviously, the name of the group. Except for associating a name with a group, named groups also behave identically to capturing groups. Named groups are still given numbers, so you can retrieve information about a group in two ways. For example, to retrieve the first group of the search pattern

?P<word>\b\w+\b)

you can specify \1 or ?P<word>

Named groups are handy because they let you use easily-remembered names, instead of having to remember numbers. Here's an example for specifying a day - month - year pattern:

(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-(?P<year>[0-9][0-9][0-9][0-9])
(?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])

It's obviously much easier to refer to ,<min> instead of having to remember to retrieve group 5.

Since the syntax for backreferences, in an expression like (...)\1, refers to the number of the group there's naturally a variant that uses the group name instead of the number. This is also a Python extension: (?P=name) indicates that the contents of the group called name should again be found at the current point. The regular expression for finding doubled words, (\b\w+)\s+\1 can also be written as (?P<word>\b\w+)\s+(?P=word):

Lookahead Assertions

Another zero-width assertion is the lookahead assertion. Lookahead assertions are available in both positive and negative form, and look like this:

(?=...): Positive lookahead assertion. This succeeds if the contained regular expression, represented here by ..., successfully matches at the current location, and fails otherwise. But, once the contained expression has been tried, the matching engine doesn't advance at all; the rest of the pattern is tried right where the assertion started.
(?!...): Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn't match at the current position in the string.

An example will help make this concrete by demonstrating a case where a lookahead is useful. Consider a simple pattern to match a filename and split it apart into a base name and an extension, separated by a ".". For example, in "news.rc", "news"is the base name, and "rc" is the filename's extension.

The pattern to match this is quite simple:

.*[.].*$

Notice that the "." needs to be treated specially because it's a metacharacter; I've put it inside a character class. Also notice the trailing $; this is added to ensure that all the rest of the string must be included in the extension. This regular expression matches "foo.bar" and "autoexec.bat" and "sendmail.cf" and "printers.conf".

Now, consider complicating the problem a bit; what if you want to match filenames where the extension is not "bat"? Some incorrect attempts:

.*[.][^b].*$

The first attempt above tries to exclude "bat" by requiring that the first character of the extension is not a "b". This is wrong, because the pattern also doesn't match "foo.bar".

.*[.]([^b]..|.[^a].|..[^t])$

The expression gets messier when you try to patch up the first solution by requiring one of the following cases to match: the first character of the extension isn't "b"; the second character isn't "a"; or the third character isn't "t". This accepts "foo.bar" and rejects "autoexec.bat", but it requires a three-letter extension and won't accept a filename with a two-letter extension such as "sendmail.cf". We'll complicate the pattern again in an effort to fix it.

.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$

In the third attempt, the second and third letters are all made optional in order to allow matching extensions shorter than three characters, such as "sendmail.cf".

The pattern's getting really complicated now, which makes it hard to read and understand. Worse, if the problem changes and you want to exclude both "bat" and "exe" as extensions, the pattern would get even more complicated and confusing.

A negative lookahead cuts through all this:

.*[.](?!bat$).*$

The lookahead means: if the expression bat doesn't match at this point, try the rest of the pattern; if bat$ does match, the whole pattern will fail. The trailing $ is required to ensure that something like "sample.batch", where the extension only starts with "bat", will be allowed.

Excluding another filename extension is now easy; simply add it as an alternative inside the assertion. The following pattern excludes filenames that end in either "bat" or "exe":

.*[.](?!bat$|exe$).*$

Regex Flags

Regex flags let you modify some aspects of how regular expressions work. Below is a table of the available flags with detailed explanations. Some of the flags which deal with interpretation of the newline character do not apply to the PFrank pattern specifiers since only one-line patterns are supported. These flags are prefixed with an asterisk in the descriptions and are only included for completeness.

For example, here's an regex that can be used when the verbose flag is set:


[0-9]+[^0-9] | x[0-9a-fA-F]+[^0-9a-fA-F]   # Decimal or Hexadecimal format

Without the verbose setting, the regex would look like this:


[0-9]+[^0-9]|x[0-9a-fA-F]+[^0-9a-fA-F]

Flag	Meaning
`i`	IGNORECASE: Perform case-insensitive matching; character class and literal strings will match letters by ignoring case. For example, [A-Z] will match lowercase letters, too, and Spam will match "`Spam`", "`spam`", or "`spAM`". This lowercasing doesn't take the current locale into account; it will if you also set the `LOCALE` flag.
`L`	LOCALE: Make \w, \W, \b, and \B, dependent on the current locale. Locales are a feature of the C library intended to help in writing programs that take account of language differences. For example, if you're processing French text, you'd want to be able to write \w+ to match words, but \w only matches the character class [A-Za-z]; it won't match "`é`" or "`ç`". If your system is configured properly and a French locale is selected, certain C functions will tell the program that "`é`" should also be considered a letter. Setting the `LOCALE` flag when compiling a regular expression will cause the resulting compiled object to use these C functions for \w; this is slower, but also enables \w+ to match French words as you'd expect.
`u`	UNICODE: Make \w, \W, \b, \B, dependent on the unicode locale.
`x`	VERBOSE: Enable verbose regex's, which can be organized more cleanly and understandably. This flag allows you to write regular expressions that are more readable by granting you more flexibility in how you can format them. When this flag has been specified, whitespace within the regex pattern is ignored, except when the whitespace is in a character class or preceded by an unescaped backslash; this lets you organize and indent the regex more clearly. It also enables you to put comments within a regex that will be ignored by the engine; comments are marked by a "`#`" that's neither in a character class or preceded by an unescaped backslash.
`B`	BUILD PATH: This is a PFrank extension. It is used to make the renamer allow the use of the forward slash character in the replace pattern. When renaming occurs, the new name is treated as a path to a file. Folders are created from the current folder down to the leaf filename. eg. a new name of 2006/06/28/picture01 will result in the folder '2006' created under the current folder, the folder '06' created under '2006, the folder '28' created under '06' and the file picture01 located in the '28' folder. Once the flag is used, it will apply to all remaining columns in the custom renaming list. The use of the flag will be disabled when Subversion (SVN) mode is enabled.
`E`	EXCLUDE EXTENSION: This is a PFrank extension. It is used to make the renamer ignore the extension of a filename.
`Z`	POST PROCESSING: This is a PFrank extension. It is used to delay the renaming expression until after the duplication checking (assuming duplication checking is enabled).
`1-4`	MATCH COUNT: This is a PFrank extension. It is used to make the renamer make up to 'n' replacements in a name, where 'n' = 1, 2, 3, or 4. Only one digit at a time can be used in the flag group.

"(?iLuxEBZ2)" is an example of the extension notation used to indicate which of the flags "i", "L", "u", "x", "B", "E", "Z", or "1-4" are to be set. The '2' flag is shown in the example as an instance of the 'MATCH COUNT' flag). The flag group itself matches the empty string.

Example usage:

(?xi)\.mp3 # matches the extension in mp3 files
is used to ignore letter cases when matching ".mp3" extensions.

(?x1)a # matches the first "a" in the names "abba" or "aardvark"
is used to ignore all 'a's after the first one.

(?2i)e
is used to match the first 2 'e's in a name like "Ed Meets the Beatles".

(?E)3
is used to match the number '3' unless it appears in an extension like "mp3" or ".m3a".

(?3x1)The
will result in an error since more than 1 digit is present in the flag specifier.

Note that the (?x) flag group changes how the expression is parsed. It must be the first set of characters in the expression string. If there are any characters before the flag group, then the flags will not work. Only one flag group can be used in an expression! If there is more than one group (eg. (?i)The(?3)(?1)(?E), then the results are undefined.

String Replacement

Once a search pattern is specified, one can then specify a replacement pattern which is used to replace any matching strings found. Below are some examples:

Search Pattern Matching String Before Replacement Replace Pattern String after Replacements

abc
abcdefgabcabc
xyz
xyzdefgxyzxyz

(blue|white|red)
blue socks and red shoes
colour
colour socks and red shoes

x*
abxd
-
-a-b-d-

abc*yz
abcccccyz
wx
wxyz

Search Pattern	Matching String Before Replacement	Replace Pattern	String after Replacements
abc	abcdefgabcabc	xyz	xyzdefgxyzxyz
(blue\|white\|red)	blue socks and red shoes	colour	colour socks and red shoes
x*	abxd	-	-a-b-d-
abc*yz	abcccccyz	wx	wxyz

If the replacement pattern is a string, any backslash escapes in it are processed. That is, "\t" is converted to a tab character, "\r" is converted to a carriage return, and so forth. Unknown escapes such as "\j" are left alone. Backreferences, such as "\6", are replaced with the substring matched by the corresponding group in the re. This lets you incorporate portions of the original text in the resulting replacement string.

This example matches the word "section" followed by a string enclosed in "{", "}", and changes "section" to "subsection":

Search Pattern Matching String Before Replacement Replace Pattern String after Replacements

section{ ( [^}]* ) }
section{First} section{second}
subsection{\1}
subsection{First} subsection{second}

Search Pattern	Matching String Before Replacement	Replace Pattern	String after Replacements
section{ ( [^}]* ) }	section{First} section{second}	subsection{\1}	subsection{First} subsection{second}

When performing string replacements, the syntax for backreferencing to named groups (as defined by the (?P<name>...) syntax) is as follows. "\g<name>" uses the substring matched by the group named "name", and "\g<number>" uses the corresponding group number. "\g<2>" is therefore equivalent to "\2", but isn't ambiguous in a replacement string such as "\g<2>0". ("\20" would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character "0".) The following substitutions are all equivalent, but use all three variations of the replacement string.
Note that PFrank only supports group names with a length of one character in the replace pattern.

Search Pattern Matching String Before Replacement Replace Pattern String after Replacements

section{ (?P<name> [^}]* ) }
section{First}
subsection{\1}
subsection{First}

section{ (?P<name> [^}]* ) }
section{First}
subsection{\g<1>}
subsection{First}

section{ (?P<name> [^}]* ) }
section{First}
subsection{\g<name>}
subsection{First}

Search Pattern	Matching String Before Replacement	Replace Pattern	String after Replacements
section{ (?P<name> [^}]* ) }	section{First}	subsection{\1}	subsection{First}
section{ (?P<name> [^}]* ) }	section{First}	subsection{\g<1>}	subsection{First}
section{ (?P<name> [^}]* ) }	section{First}	subsection{\g<name>}	subsection{First}

More Info

For complete information on Python-style regular expressions see the Python Library Reference:
http://www.python.org/doc/2.3/lib/re-syntax.html/.

or visit the web sites listed on the PFrank Web Site at: .
http://www3.telus.net/pfrank/PFrankRegexLinks.html.